Performing a test of memory components with fault tolerance

ABSTRACT

An indication that a test resource of a test platform has failed can be received. The test resource can be associated with performing a portion of a test of memory components. A characteristic of the test resource that failed can be determined. Another test resource of the test platform can be identified based on the characteristic of the test resource that failed. The portion of the test of memory components can be performed based on the another test resource of the test platform.

TECHNICAL FIELD

The present disclosure generally relates to a memory sub-system, andmore specifically, relates to performing a test of memory componentswith fault tolerance for the memory components of memory sub-systems.

BACKGROUND

A memory sub-system can be a storage system, such as a solid-state drive(SSD), or a hard disk drive (HDD). A memory sub-system can be a memorymodule, such as a dual in-line memory module (DIMM), a small outlineDIMM (SO-DIMM), or a non-volatile dual in-line memory module (NVDIMM). Amemory sub-system can include one or more memory components that storedata. The memory components can be, for example, non-volatile memorycomponents and volatile memory components. In general, a host system canutilize a memory sub-system to store data at the memory components andto retrieve data from the memory components.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure will be understood more fully from the detaileddescription given below and from the accompanying drawings of variousimplementations of the disclosure.

FIG. 1 illustrates an example environment to perform a test of memorycomponents with fault tolerance of a portion of the test in accordancewith some embodiments of the present disclosure.

FIG. 2 is a flow diagram of an example method to perform a test ofmemory components in accordance with some embodiments of the presentdisclosure.

FIG. 3 is a flow diagram of an example method to perform a test ofmemory components by initiating a portion of the test at a new testresource based on a failure of another test resource in accordance withsome embodiments.

FIG. 4A illustrates the allocation of test resources to perform a testof memory components in accordance with some embodiments of the presentdisclosure.

FIG. 4B illustrates the test of memory components where a portion of thetest has failed in accordance with some embodiments of the presentdisclosure.

FIG. 4C illustrates the test of memory components where the portion ofthe test that failed has been restarted at a new test resource inaccordance with some embodiments of the present disclosure.

FIG. 5 is a flow diagram of an example method to perform a test ofmemory components based on dependences between test resources inaccordance with some embodiments.

FIG. 6 is a block diagram of an example computer system in whichimplementations of the present disclosure can operate.

DETAILED DESCRIPTION

Aspects of the present disclosure are directed to performing a test ofmemory components with fault tolerance. A memory sub-system is alsohereinafter referred to as a “memory device.” An example of a memorysub-system is a storage device that is coupled to a central processingunit (CPU) via a peripheral interconnect (e.g., an input/output bus, astorage area network). Examples of storage devices include a solid-statedrive (SSD), a flash drive, a universal serial bus (USB) flash drive,and a hard disk drive (HDD). Another example of a memory sub-system is amemory module that is coupled to the CPU via a memory bus. Examples ofmemory modules include a dual in-line memory module (DIMM), a smalloutline DIMM (SO-DIMM), a non-volatile dual in-line memory module(NVDIMM), etc. The memory sub-system can be a hybrid memory/storagesub-system. In general, a host system can utilize a memory sub-systemthat includes one or more memory components. The host system can providedata to be stored at the memory sub-system and can request data to beretrieved from the memory sub-system.

The memory components that are used in a memory sub-system can be testedbefore being utilized in the memory sub-system. In a conventional testprocess, the memory components can be placed into a chamber (i.e., anoven) that tests the memory components under various temperatureconditions. For example, a single chamber can be used to test multiplememory components at a single time at a particular temperature. The testprocess can instruct various operations to be performed at the memorycomponents at the particular temperature. Such operations can include,but are not limited to, read operations, write operations, and eraseoperations. The performance and behavior of the memory components can beobserved while the test process is performed. For example, performancecharacteristics (e.g., read or write latencies) and reliability of datastored at the memory components can be measured and recorded during thetest process. However, since the chamber can only apply a singletemperature to the memory components at any particular time, the testingof the memory components at many different temperatures can require alarge amount of time as the test process will need to be performed foreach desired temperature. Additionally, the chamber can only perform asingle test process at a time. As such, performing different tests ofthe memory components at different operating conditions (e.g., differenttemperatures) can utilize a large amount of time if many differentconditions of the test process for the memory components are desired.

Aspects of the present disclosure address the above and otherdeficiencies by performing a test of memory components with faulttolerance of a portion of the test. The test of the memory componentscan be performed by a distributed test platform that includes multipletest resources. Each test resource of the test platform can be a testsocket that includes a memory component that can be utilized by the testand a temperature control component that is used to apply a particulartemperature condition to the memory component as part of the test. Thetest platform can further include multiple test boards that eachincludes one or more of the test sockets. The test boards can beorganized into groups or racks and multiple racks can be at a particularlocation or site. As such, the test platform can include multiple testsockets.

The test can specify to use a number of the test sockets that includememory components along with a sequence of operations that are to beperformed with the tested memory components and the temperaturecondition that is to be applied to the memory components during thetest. Furthermore, the test sockets that are used to perform the testcan be distributed throughout multiple test boards, test racks, and/orlocations of the test platform. Thus, for a particular test of memorycomponents, the memory components used in the test can be embedded atdifferent test sockets at different locations. If a particular testsocket fails (i.e., malfunctions) during the operation of the test ofthe memory components, then the portion of the test that was beingperformed by the failed test socket can be incomplete. For example, dataspecifying the performance characteristics, reliability of data, orother such observations of the memory component being tested at thefailed test socket can be incomplete. In response to the test socketfailing, the test platform can assign a new test socket to be allocatedfor use by the test in order to complete the portion of the test thatwas being performed by the failed test socket. For example, the testplatform can identify another memory component that is available at anunused test socket. Such a test socket can be referred to as anavailable test socket which is a test socket that is not currently beingused by the test of memory components or by another separate test ofmemory components that is concurrently being performed at the testplatform. The test platform can determine whether another memorycomponent at an available test socket matches characteristics of thememory component that was included in the test socket that failed. Suchcharacteristics can be prior usage characteristics, design information,or other such information of a memory component. If the other memorycomponent at the available test socket matches the characteristics ofthe memory component at the failed test socket, then the test platformcan initiate the portion of the test at the new test socket with theother memory component. For example, the portion of the test that wasbeing performed at the failed test socket can be replicated at the newtest socket. The operations and temperature condition can then beapplied to the new test socket to complete the portion of the test.Subsequently, the test platform can combine the results of the portionof the test from the failed test socket (e.g., before the failureoccurred) and the new test socket. In some embodiments, the new testsocket can perform the portion of the test and the results can beretrieved from the new test socket without retrieving any data from thefailed test socket.

In some embodiments, the test platform can provide an alert notificationwhen a test socket has failed. For example, when a test socket hasfailed, the test rack that includes the test board with the failed testsocket can transmit a notification that indicates that the test sockethas failed. The failure of the test socket can result in a pausing ofthe portion of the test that was being performed by the failed testsocket. Once the failure of the test socket has been resolved (e.g., thetest socket or test board is replaced), then the test platform canresume the performance of the portion of the test at the replaced testsocket.

The test platform can further pause a first portion of the test at afirst test socket based on a failure of a second test socket. The firstportion of the test can be dependent on a result of a second portion ofthe test that was being performed at the second test socket. Forexample, the first portion of the test can specify that particularoperating conditions (e.g., operations to be performed and/or an appliedtemperature condition) are to be applied at the memory component of thefirst test socket based on the results of the second test socket. If thesecond test socket fails, then the test platform can also pause thefirst test socket as the first test socket is to operate based onresults of the second test socket. Subsequently, the test platform canresume the first portion of the test when the failure of the second testsocket has been resolved.

Advantages of the present disclosure include, but are not limited to, adecrease in the amount of time that is used to perform tests of thememory components. For example, the testing of the memory components canutilize a period of time and if a particular portion of the testplatform fails while performing the test, the test platform can pause aportion of the test that failed while other portions of the test atother portions of the test platform can continue to operate. As aresult, the entire test of memory components does not need to berestarted if a portion of the test fails at one of the memorycomponents. Thus, the resources of the test platform can also beutilized to perform more tests of memory components as the resources ofthe test platform can be more available to perform additional tests ofmemory components as opposed to repeating portions of a test of memorycomponents that had failed during execution of the test.

FIG. 1 illustrates an example environment to allocate test resources toperform a test of memory components in accordance with some embodimentsof the present disclosure. A test platform 100 can include one or moreracks 110A, 110B, and 110N. Each of the racks 110A, 110B, and 110N caninclude multiple test boards 120 where each test board 120 includes oneor more test sockets (i.e., test resources). The test platform 100 caninclude any number of racks or test sockets.

As shown, a test board 120 can include one or more test sockets. Forexample, a test board 120 can include a first test socket 121, a secondtest socket 122, and a third test socket 123. Although three testsockets are shown, a test board 120 can include any number of testsockets. Each test socket can include a memory component that has beenembedded within the respective test socket. Additionally, each testsocket can include a temperature control component that is used to applya temperature condition to the embedded memory component. In someembodiments, the temperature control component can be a dual Peltierdevice (e.g., two Peltier devices) that utilize a Peltier effect toapply a heating or cooling effect at a surface of the dual Peltierdevice that is coupled to the embedded memory component. In the same oralternative embodiments, the temperature control component can be placedon top of the memory component in the respective test socket.

As shown, each test rack 110A, 110B, and 110N can include multiple testboards 120. Each of the test boards 120 of a particular test rack can becoupled with a local test component. For example, each test rack 110A,110B, and 110N can respectively include a local test component 111A,111B, and 111N. Each of the local test components 111A, 111B, and 111Ncan receive instructions to perform a test or a portion of a test thatis to be performed at the test sockets of the respective test rack. Forexample, a resource allocator component 132 can receive (e.g., from auser) conditions of the test that is to be performed and the resourceallocator component 132 can determine particular test sockets across thedifferent test boards 120 at one or more of the test racks 110A, 110B,and 110N that can be used by the test. In some embodiments, the resourceallocator component 132 can be provided by a server 131. In someembodiments, the server 131 is a computing device or system that iscoupled with the local test components 111A, 111B, and 111N over anetwork.

The temperate control component of each test socket 121, 122, and 123 ofeach test board 120 can be used to apply a different temperaturecondition to the respective embedded memory component. Furthermore, eachtest socket 121, 122, and 123 can be used to perform differentoperations at the embedded memory components.

The resource allocator component 132 can receive a test input from auser. The test input can specify conditions of the test that is to beperformed with one or more memory components. For example, the test canspecify particular temperature conditions that are to be applied tomemory components and a sequence of operations that are to be performedat memory components under particular temperature conditions. Theresource allocator 132 can retrieve a data structure or database 133that includes test resource data that identifies available test socketsacross the test platform 100 as well as characteristics of the availabletest sockets. The database 133 can include usage characteristics anddesign information of the memory components that can be used to assignthe test resources to a test. The resource allocator component 130 canassign test sockets at the test platform 100 that include embeddedmemory components that match or satisfy the conditions of the test. Theresource allocator component 133 can then transmit instructions to localtest components of test racks that include test sockets that are to beused in the test. Additionally, the resource allocator component 133 canreceive test results from the different test sockets.

The test platform 100 can further include a fault tolerance component130 that is used to provide fault tolerance for a test that is beingperformed at the test resources of the test platform 100. For example,the fault tolerance component 130 can receive an indication of a testresource (i.e., a test socket) that has failed and has become unable tocomplete a portion of a test being performed at various test resourcesof the test platform 100. The fault tolerance component 130 can pausethe test or restart the portion of the test at another test resourcebased on the indicated failure. Further details with respect to thefault tolerance component 130 are described below.

FIG. 2 illustrates an example method 200 to perform a test of memorycomponents in accordance with some embodiments of the presentdisclosure. The method 200 can be performed by processing logic that caninclude hardware (e.g., processing device, circuitry, dedicated logic,programmable logic, microcode, hardware of a device, integrated circuit,etc.), software (e.g., instructions run or executed on a processingdevice), or a combination thereof. In some embodiments, the method 200is performed by the fault tolerance component 130 of FIG. 1. Althoughshown in a particular sequence or order, unless otherwise specified, theorder of the processes can be modified. Thus, the illustratedembodiments should be understood only as examples, and the illustratedprocesses can be performed in a different order, and some processes canbe performed in parallel. Additionally, one or more processes can beomitted in various embodiments. Thus, not all processes are required inevery embodiment. Other process flows are possible.

As shown, at operation 210, the processing logic determines testresources of a test platform that are performing a test of memorycomponents. For example, a data structure can be retrieved where thedata structure identifies the test resources that are present at thetest platform. The data structure can specify each test resource,characteristics of memory components at each test resource, locations ofthe test resources, indications of whether particular test resources arebeing used by a particular test and whether other test resources arebeing used by other tests of memory components at the test platform. Atoperation 220, the processing logic receives an indication that aparticular test resource of the test platform that has been performing aportion of the test of memory components has failed. A test resource canbe considered to fail when the test resource cannot complete a portionof a test that has been scheduled to be performed or was being performedat the test resource. For example, the test resource can fail if a testboard that includes the test resource becomes unable to transmitoperations to be performed at the memory component. Thus, the testresource can be considered to fail if the test board that includes thetest resource has failed. The test resource can additionally beconsidered to fail if the memory component embedded within the testresource becomes corrupted or if the temperature control element of thetest resource is not capable of applying the requested temperaturecondition to the memory component included in the test resource.

As shown in FIG. 2, at operation 230, the processing logic performs aremaining portion of the test of the memory components based on theindication that the particular test resource has failed during the test.For example, the portion of the test that was being performed at thefailed test resource can be paused while other portions of the testbeing performed at other test resources can still be performed. In someembodiments, another test resource that is not currently being used byanother test and that matches characteristics of the memory component ofthe failed test resource can be used to perform the remaining portion ofthe test. Further details with respect to performing the remainingportion of the test at another test resource are described inconjunction with FIG. 3. In some embodiments, the failure of the testresource can result in a pausing of another test resource until thefailed test resource has been replaced. For example, another testresource can be paused if the other test resource is dependent onresults of the failed test resource. Further details with respect topausing a test resource based on a dependency are described inconjunction with FIG. 5.

FIG. 3 is a flow diagram of an example method 300 to perform a test ofmemory components by initiating a portion of the test at a new testresource based on a failure of another test resource in accordance withsome embodiments. The method 300 can be performed by processing logicthat can include hardware (e.g., processing device, circuitry, dedicatedlogic, programmable logic, microcode, hardware of a device, integratedcircuit, etc.), software (e.g., instructions run or executed on aprocessing device), or a combination thereof. In some embodiments, themethod 300 is performed by the fault tolerance component 130 of FIG. 1.Although shown in a particular sequence or order, unless otherwisespecified, the order of the processes can be modified. Thus, theillustrated embodiments should be understood only as examples, and theillustrated processes can be performed in a different order, and someprocesses can be performed in parallel. Additionally, one or moreprocesses can be omitted in various embodiments. Thus, not all processesare required in every embodiment. Other process flows are possible.

As shown, at operation 310, the processing logic receives an indicationthat a test resource of a test platform that has been performing aportion of a test of memory components has failed. For example, a testboard that includes the test resource or a test rack that includes thetest board can transmit a notification that the test resource hasfailed. The test resource can be a single test resource out of multipletest resources that are being used to perform the test of memorycomponents. In some embodiments, the notification can identify theparticular test that has failed and can identify a location of the testresource that failed. For example, when a user provides inputs to have atest performed at the test platform, the user can provide a name orother such identification for the test. The notification can thusinclude the identification of the test. Furthermore, the notificationcan specify the location, test rack, and test board that include thespecific test resource (i.e., test socket) that failed.

At operation 320, the processing logic identifies another test resourceof the test platform that matches characteristics of the test resourcethat failed. The characteristics of the test resource can be usagecharacteristics of the memory component that is included in the failedtest resource. For example, the usage characteristics can specify anumber of operations that have been performed on the memory component ofthe test resource that failed. In some embodiments, the usagecharacteristics can a specify a number of program-erase operations orcycles and a particular number of read operations that have beenperformed on the memory component during the test and any prior tests.In some embodiments, the usage characteristics can specify the priortemperature conditions that have been applied to the memory componentduring the use of the memory component in the test and prior tests. Forexample, the prior temperatures at which operations have been performedat the memory component for prior tests can be specified (i.e., atemperature profile of the memory component). Thus, the characteristicscan be based on a usage history of the memory component. In someembodiments, the characteristics can specify a type of the memorycomponent that is included in the failed test resource. For example,particular versions (i.e., design or manufacturing revisions) of thememory component can be specified.

The matching test resource can be a test resource of the test platformthat is not currently being used by any test at the test platform. Forexample, the matching test resource can be a test resource with a memorycomponent that matches the characteristics of the memory componentincluded in the failed test resource and is not currently being used bythe test or another test at the test platform.

As shown in FIG. 3, at operation 330, the processing logic performs aremaining portion of the test of memory components at the other testresource of the test platform. For example, the portion of the test thatwas not completed as a result of the failure of the test resource can beperformed at the other test resource. In some embodiments, when the testresource fails, the test resource (or the test board that includes thetest resource) can provide an indication of the portion of the testallocated to the test resource that had completed and the remainingportion of the test allocated to the test resource that had notcompleted. For example, the portion of the test allocated to the testresource can be to perform a sequence of operations on the includedmemory component at one or more temperatures. The indication canidentify which of the operations at requested temperatures have beenperformed and which of the operations at requested temperatures have notbeen performed. For example, the indication can identify the lastoperation in the sequence of operations allocated to the test resourcethat has been performed. The subsequent operations in the sequence ofoperations can then be performed at the requested one or moretemperatures at the memory component included in the other testresource.

At operation 340, the processing logic receives the results from thetest resource and the other test resource for the test of memorycomponents. For example, the results of the portion of the test thatwere performed by the failed test resource and the results of theremaining portion of the test that were performed by the other testresource can be received from the respective test resources. Atoperation 350, the processing logic combines the results from the testresource and the other test resource for the portion of the test of thememory components. For example, the results of the portion of thesequence of operations performed at the test resource before failure canbe combined with the remaining portion of the sequence of operationsperformed at the other test resource.

In some embodiments, the portion of the test that was performed by thefailed test resource can be performed from the beginning at the new testresource (i.e., the other test resource). For example, the entiresequence of operations can be performed at the new test resource at theone or more requested temperatures. Thus, the sequence of operationsthat was to be performed at the failed test resource can be performed atthe new test resource.

FIG. 4A illustrates the allocation of test resources to perform a testof memory components in accordance with some embodiments of the presentdisclosure. The test resources can be allocated by processing logic thatcan include hardware (e.g., processing device, circuitry, dedicatedlogic, programmable logic, microcode, hardware of a device, integratedcircuit, etc.), software (e.g., instructions run or executed on aprocessing device), or a combination thereof. In some embodiments, thetest resources can be allocated by the fault tolerance component 130 ofFIG. 1.

As shown, a test of memory components can be performed at a testplatform that includes test racks 410 and 420. For example, the testresources 411 with the checkmark can indicate test resources that arecurrently being used to perform the test of the memory components. Thetest resources without the checkmark indicate test resources that arecurrently available and not being used by any test at the test platform.

FIG. 4B illustrates the test of memory components where a portion of thetest has failed in accordance with some embodiments of the presentdisclosure. As shown, a failure of the test resources 412, 413, and 414can occur. In some embodiments, the failure of the test resources 412,413, and 414 can be the test board that includes the test resources 412,413, and 414 failing as a result of a physical malfunction of the testboard, a network connection of the test board failing, etc. As a result,the failure of the test board can result in a failure of each of thetest resources 412, 413, and 414 that are located on the test board. Insome embodiments, an individual test resource can fail. For example, aparticular test resource can be considered to fail if the temperaturecontrol element of the test resource cannot apply the requestedtemperature condition to the included memory component of the testresource. The other test resources (e.g., as indicated by the checkmark)that were performing other portions of the test can still be performingthe test.

FIG. 4C illustrates the test of memory components where the portion ofthe test that failed has been restarted at a new test resource inaccordance with some embodiments of the present disclosure. As shown,the portion of the test that was being performed at the failed testresources 412, 413, and 414 can be restarted at new test resources 423,424, and 425. The new test resources 423, 424, and 425 can perform thesequence of operations that was allocated to the failed test resources412, 413, and 414 or can perform the remaining operations of thesequence of operations that had not been performed when the testresources 412, 413, and 414 had failed.

In some embodiments, the new test resources can be selected to be in anew test rack that is separate from the test rack that includes thefailed test resources. In some embodiments, the new test resources canbe selected based on a location of the failed test resources. Forexample, the new test resources can be test resources that match thefailed test resources and that are at the same test racks or closerlocations to the failed test resources.

FIG. 5 is a flow diagram of an example method 500 to perform a test ofmemory components based on dependences between test resources inaccordance with some embodiments. The method 500 can be performed byprocessing logic that can include hardware (e.g., processing device,circuitry, dedicated logic, programmable logic, microcode, hardware of adevice, integrated circuit, etc.), software (e.g., instructions run orexecuted on a processing device), or a combination thereof. In someembodiments, the method 500 is performed by the fault tolerancecomponent 130 of FIG. 1. Although shown in a particular sequence ororder, unless otherwise specified, the order of the processes can bemodified. Thus, the illustrated embodiments should be understood only asexamples, and the illustrated processes can be performed in a differentorder, and some processes can be performed in parallel. Additionally,one or more processes can be omitted in various embodiments. Thus, notall processes are required in every embodiment. Other process flows arepossible.

As shown, at operation 510, the processing logic receives an indicationthat a first test resource of a test platform that has been performing aportion of a test of memory components has failed. For example, theindication can identify that the first resource cannot complete thesequence of operations that have been allocated to the first testresource. The indication can identify a particular operation of thesequence of operations that was the last operation that the firstresource had performed. At operation 520, the processing logicdetermines whether a second test resource of the test platform that hasbeen performing another portion of the test of the memory components isdependent on the first test resource. A dependency between the firsttest resource and the second test resource can be that the performanceof the portion of the test at the second test resource is based on oneor more results of the first test resource. For example, the second testresource can specify that a particular operation (e.g., a readoperation, write operations, or erase operation) at a particulartemperature is to be performed based on how the memory component of thefirst test resource performs a defined operation at a definedtemperature. Thus, a particular operation to be performed and/or aparticular temperature condition to be applied at the second testresource can be based on a result of the behavior or operation of thememory component included at the first test resource. As such, thesecond test resource can be considered dependent upon the first testresource.

In some embodiments, the dependency can be based on the results of atest at test resources. For example, various settings (i.e., trims) canbe set or defined for a memory component where the different settingscan influence the operation and behavior of the memory component. Forexample, a particular setting that is defined for the memory componentcan change the functionality, reliability, and performance of the memorycomponent. The settings or trims of the memory component can be updatedor changed. In some embodiments, a test can be performed on multipletest resources by iteratively modifying the settings or trims of thememory components at the test resources. For example, a first portion ofthe test can be performed at a first test resource with a memorycomponent at a first setting and a second portion of the test can beperformed at a second test resource with another memory component at asecond setting, etc. The results from the test at each of the memorycomponents can then be combined to determine new settings for the memorycomponents that are to be performed at a next test or a subsequentportion of the test. Thus, a test resource can be dependent on anothertest resource when the results of each of the test resources are used toperform a subsequent test based on a setting or trim that is based onthe results of the test performed at the test resources.

In some embodiments, a test resource can be dependent on another testresource when a same type of test is being performed at the testresources. For example, the test can perform similar operations atsimilar conditions at test resources that include memory components withsimilar characteristics. If one of the test resources of the testdetects an anomaly or a failure such as a type of behavior of the memorycomponent that is unexpected, then another test resource performing thesame test can be paused. In some embodiments, the test board with thetest resource that detected the anomaly can transmit an interrupt orinstruction to the other test resource to pause the portion of the testat the other test resource. In some embodiments, after the other testresource is paused, an action can be performed at the test resource.Such an action can be a measurement of a characteristic or observationof the state of the other test resource before the sequence of eventsthat caused the anomaly to appear at the prior test resource areperformed at the other test resource.

At operation 530, the processing logic pauses the other portion of thetest at the second test resource based on the determination that thesecond test resource is dependent on the first test resource. Forexample, an instruction can be transmitted to the second test resourceto pause the performance of the portion of the test that was beingperformed at the second test resource. The instruction can indicate thatthe first test resource that the second test resource is dependent uponhas failed. In some embodiments, the instruction can indicate aparticular operation of the sequence of operations at the second testresource where the second test resource is to be paused. For example,the instruction can indicate the point in the sequence of operationsthat the second test resource is to pause and not perform subsequentoperations of the sequence of operations. At operation 540, theprocessing logic receives an indication that the failure of the firsttest resource has been resolved. For example, the indication canidentify that the first test resource has resumed the portion of thetest allocated to the first test resource. In some embodiments, theresolving of the failure can be that the portion of the test allocatedto the first test resource has been restarted or resumed at a new testresource as previously identified. In some embodiments, the indicationcan specify a difference between the first test resource and the newtest resource that has replaced the first test resource. The differencecan specify a difference between the usage characteristics of the firsttest resource and the second test resource. The second test resource canthen use the difference between usage characteristics to adjust theperformance of the test at the second test resource. For example, adifferent temperature or an increased number of operations can beperformed at the second test resource. At operation 550, the processinglogic resumes the other portion of the test at the second test resourceafter the failure of the first test resource has been resolved. Forexample, an indication can be transmitted to the second test resource toresume the remaining portion of the test allocated to the second testresource.

FIG. 6 illustrates an example machine of a computer system 600 withinwhich a set of instructions, for causing the machine to perform any oneor more of the methodologies discussed herein, can be executed. In someembodiments, the computer system 600 can correspond to a host or serversystem that includes, is coupled to, or utilizes a test platform (e.g.,to execute operations corresponding to the fault tolerance component 130of FIG. 1). In alternative embodiments, the machine can be connected(e.g., networked) to other machines in a LAN, an intranet, an extranet,and/or the Internet. The machine can operate in the capacity of a serveror a client machine in client-server network environment, as a peermachine in a peer-to-peer (or distributed) network environment, or as aserver or a client machine in a cloud computing infrastructure orenvironment.

The machine can be a personal computer (PC), a tablet PC, a set-top box(STB), a Personal Digital Assistant (PDA), a cellular telephone, a webappliance, a server, a network router, a switch or bridge, or anymachine capable of executing a set of instructions (sequential orotherwise) that specify actions to be taken by that machine. Further,while a single machine is illustrated, the term “machine” shall also betaken to include any collection of machines that individually or jointlyexecute a set (or multiple sets) of instructions to perform any one ormore of the methodologies discussed herein.

The example computer system 600 includes a processing device 602, a mainmemory 604 (e.g., read-only memory (ROM), flash memory, dynamic randomaccess memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM(RDRAM), etc.), a static memory 606 (e.g., flash memory, static randomaccess memory (SRAM), etc.), and a data storage system 618, whichcommunicate with each other via a bus 630.

Processing device 602 represents one or more general-purpose processingdevices such as a microprocessor, a central processing unit, or thelike. More particularly, the processing device can be a complexinstruction set computing (CISC) microprocessor, reduced instruction setcomputing (RISC) microprocessor, very long instruction word (VLIW)microprocessor, or a processor implementing other instruction sets, orprocessors implementing a combination of instruction sets. Processingdevice 602 can also be one or more special-purpose processing devicessuch as an application specific integrated circuit (ASIC), a fieldprogrammable gate array (FPGA), a digital signal processor (DSP),network processor, or the like. The processing device 602 is configuredto execute instructions 626 for performing the operations and stepsdiscussed herein. The computer system 600 can further include a networkinterface device 608 to communicate over the network 620.

The data storage system 618 can include a machine-readable storagemedium 624 (also known as a computer-readable medium) on which is storedone or more sets of instructions 626 or software embodying any one ormore of the methodologies or functions described herein. Theinstructions 626 can also reside, completely or at least partially,within the main memory 604 and/or within the processing device 602during execution thereof by the computer system 600, the main memory 604and the processing device 602 also constituting machine-readable storagemedia. The machine-readable storage medium 624, data storage system 618,and/or main memory 604 can correspond to a memory sub-system.

In one embodiment, the instructions 626 include instructions toimplement functionality corresponding to a fault tolerance component(e.g., the fault tolerance component 130 of FIG. 1). While themachine-readable storage medium 624 is shown in an example embodiment tobe a single medium, the term “machine-readable storage medium” should betaken to include a single medium or multiple media that store the one ormore sets of instructions. The term “machine-readable storage medium”shall also be taken to include any medium that is capable of storing orencoding a set of instructions for execution by the machine and thatcause the machine to perform any one or more of the methodologies of thepresent disclosure. The term “machine-readable storage medium” shallaccordingly be taken to include, but not be limited to, solid-statememories, optical media, and magnetic media.

Some portions of the preceding detailed descriptions have been presentedin terms of algorithms and symbolic representations of operations ondata bits within a computer memory. These algorithmic descriptions andrepresentations are the ways used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of operations leading to adesired result. The operations are those requiring physicalmanipulations of physical quantities. Usually, though not necessarily,these quantities take the form of electrical or magnetic signals capableof being stored, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. The presentdisclosure can refer to the action and processes of a computer system,or similar electronic computing device, that manipulates and transformsdata represented as physical (electronic) quantities within the computersystem's registers and memories into other data similarly represented asphysical quantities within the computer system memories or registers orother such information storage systems.

The present disclosure also relates to an apparatus for performing theoperations herein. This apparatus can be specially constructed for theintended purposes, or it can include a general purpose computerselectively activated or reconfigured by a computer program stored inthe computer. Such a computer program can be stored in a computerreadable storage medium, such as, but not limited to, any type of diskincluding floppy disks, optical disks, CD-ROMs, and magnetic-opticaldisks, read-only memories (ROMs), random access memories (RAMs), EPROMs,EEPROMs, magnetic or optical cards, or any type of media suitable forstoring electronic instructions, each coupled to a computer system bus.

The algorithms and displays presented herein are not inherently relatedto any particular computer or other apparatus. Various general purposesystems can be used with programs in accordance with the teachingsherein, or it can prove convenient to construct a more specializedapparatus to perform the method. The structure for a variety of thesesystems will appear as set forth in the description below. In addition,the present disclosure is not described with reference to any particularprogramming language. It will be appreciated that a variety ofprogramming languages can be used to implement the teachings of thedisclosure as described herein.

The present disclosure can be provided as a computer program product, orsoftware, that can include a machine-readable medium having storedthereon instructions, which can be used to program a computer system (orother electronic devices) to perform a process according to the presentdisclosure. A machine-readable medium includes any mechanism for storinginformation in a form readable by a machine (e.g., a computer). In someembodiments, a machine-readable (e.g., computer-readable) mediumincludes a machine (e.g., a computer) readable storage medium such as aread only memory (“ROM”), random access memory (“RAM”), magnetic diskstorage media, optical storage media, flash memory components, etc.

In the foregoing specification, embodiments of the disclosure have beendescribed with reference to specific example embodiments thereof. Itwill be evident that various modifications can be made thereto withoutdeparting from the broader spirit and scope of embodiments of thedisclosure as set forth in the following claims. The specification anddrawings are, accordingly, to be regarded in an illustrative senserather than a restrictive sense.

What is claimed is:
 1. A method comprising: receiving an indication thata test resource of a test platform has failed, wherein the test resourceis associated with performing a portion of a test of memory components;determining a characteristic of the test resource that failed;identifying another test resource of the test platform based on thecharacteristic of the test resource that failed; and performing, by aprocessing device, the portion of the test of memory components based onthe another test resource of the test platform.
 2. The method of claim1, wherein test resource comprises a memory component, and wherein thecharacteristic of the test resource corresponds to a usagecharacteristic of the memory component of the test resource.
 3. Themethod of claim 2, wherein the identifying of the another test resourceof the test platform based on the characteristic of the test resourcethat failed comprises determining that a usage characteristic of anothermemory component of the another test resource matches the usagecharacteristic of the memory component of the test resource.
 4. Themethod of claim 1, wherein the test corresponds to a sequence ofoperations performed at the memory component, and wherein the performingof the portion of the test of memory components based on the anothertest resource comprises: determining a last performed operation of thesequence of operations that was performed at the test resource beforethe failure of the test resource; and performing one or more additionaloperations subsequent to the last performed operation of the sequence ofoperations at another memory component of the another test resource. 5.The method of claim 1, further comprising: receiving results of theportion of the test of memory components performed at the test resourcebefore the failure of the test resource; and receiving additionalresults of the portion of the test of memory components performed at theanother test resource after the failure of the test resource.
 6. Themethod of claim 1, wherein the characteristic of the test resourceindicates a number of prior operations that have been performed at thetest resource.
 7. The method of claim 1, wherein the characteristic ofthe test resource indicates a prior temperature condition that has beenapplied to the test resource.
 8. A system comprising: a memorycomponent; and a processing device, operatively coupled with the memorycomponent, to: receive an indication that a first test resource of atest platform that has been performing a first portion of a test of aplurality of memory components has failed; determine whether a secondtest resource of the test platform that has been performing a secondportion of the test of the plurality of memory components is dependenton the first test resource of the test platform; and in response todetermining that the second test resource is dependent on the first testresource, pause a performance of the second portion of the test by thesecond test resource of the test platform.
 9. The system of claim 8,wherein to determine whether the second test resource is dependent onthe first test resource, the processing device is further to: determinewhether a result of the first portion of the test performed by the firsttest resource is used by the second test resource to perform the secondportion of the test.
 10. The system of claim 9, wherein the result ofthe first portion of the test specifies a particular temperaturecondition that is to be applied to a particular memory component of thesecond test resource.
 11. The system of claim 9, wherein the result ofthe first portion of the test specifies a particular operation that isto be performed at a particular memory component of the second testresource.
 12. The system of claim 9, wherein the result of the firstportion of the test corresponds to a behavior of a particular memorycomponent included in the first test resource in response to aparticular operation performed on the particular memory component at aparticular temperature condition.
 13. The system of claim 8, wherein theprocessing device is further to: receive an indication that the failureof the first test resource has been resolved; and provide an instructionto resume the performance of the second portion of the test by thesecond test resource of the test platform.
 14. A non-transitory computerreadable medium comprising instructions, which when executed by aprocessing device, cause the processing device to perform operationscomprising: receiving an indication that a test resource of a testplatform has failed, wherein the test resource is associated withperforming a portion of a test of memory components; determining acharacteristic of the test resource that failed; identifying anothertest resource of the test platform based on the characteristic of thetest resource that failed; and performing the portion of the test ofmemory components based on the another test resource of the testplatform.
 15. The non-transitory computer readable medium of claim 14,wherein test resource comprises a memory component, and wherein thecharacteristic of the test resource corresponds to a usagecharacteristic of the memory component of the test resource.
 16. Thenon-transitory computer readable medium of claim 15, wherein theidentifying of the another test resource of the test platform based onthe characteristic of the test resource that failed comprisesdetermining that a usage characteristic of another memory component ofthe another test resource matches the usage characteristic of the memorycomponent of the test resource.
 17. The non-transitory computer readablemedium of claim 14, wherein the characteristic of the test resourceindicates a number of prior operations that have been performed at thetest resource.
 18. The non-transitory computer readable medium of claim14, wherein the test corresponds to a sequence of operations performedat the memory component, and wherein to perform the portion of the testof memory components based on the another test resource, the operationsfurther comprise: determining a last performed operation of the sequenceof operations that was performed at the test resource before the failureof the test resource; and performing one or more additional operationssubsequent to the last performed operation of the sequence of operationsat another memory component of the another test resource.
 19. Thenon-transitory computer readable medium of claim 14, wherein thecharacteristic of the test resource indicates a prior temperaturecondition that has been applied to the test resource.
 20. Thenon-transitory computer readable medium of claim 14, the operationsfurther comprising: receiving results of the portion of the test ofmemory components performed at the test resource before the failure ofthe test resource; and receiving additional results of the portion ofthe test of memory components performed at the another test resourceafter the failure of the test resource.